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Abstract 

Consider a transmission scheme with a single transmitter and multiple receivers over a faulty 
broadcast channel. Eor each receiver, the transmitter has a unique infinite stream of packets, and its goal 
is to deliver them at the highest throughput possible. While such multiple-unicast models are unsolved 
in general, several network coding based schemes were suggested. In such schemes, the transmitter can 
either send an uncoded packet, or a coded packet which is a function of a few packets. The packets 
sent can be received by the designated receiver (with some probability) or heard and stored by other 
receivers. Two functional modes are considered; the first presumes that the storage time is unlimited, 
while in the second it is limited by a given Time to Expire (TTE) parameter. 

We model the transmission process as an infinite-horizon Markov Decision Process (MDP). Since 
the large state space renders exact solutions computationally impractical, we introduce policy restricted 
and induced MDPs with significantly reduced state space, and prove that with proper reward function 
they have equal optimal value function (hence equal optimal throughput). We then derive a reinforcement 
learning algorithm, which learns the optimal policy for the induced MDP. This optimal strategy of the 
induced MDP, once applied to the policy restricted one, significantly improves over uncoded schemes. 
Next, we enhance the algorithm by means of analysis of the structural properties of the resulting reward 
functional. We demonstrate that our method scales well in the number of users, and automatically 
adapts to the packet loss rates, unknown in advance. In addition, the performance is compared to the 
recent bound by Wang, which assumes much stronger coding (e.g., intra-session and buffering of coded 
packets), yet is shown to be comparable. 


Parts of this work will appear at the IEEE International Symposium on Information Theory, ISIT 2015, Hong Kong. 
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I. Introduction 

Typical wireless access architeetures eonstitute a gateway, or an Aeeess Point (AP), to whieh all 
nearby elients are eonneeted by means of a wireless medium. Among the prominent examples for 
sueh arehiteeture is the prevailing IEEE 802.11 or ETE infrastrueture mode setting. The downlink 
traffie implied in sueh topology comprises an AP sending (usually independent) traffie streams to 
the corresponding users. Eurthermore, common wireless standards incorporate reliability mech¬ 
anisms in order to overeome the inherently poor qualities of the radio channel. Eor example, 
IEEE 802.11, like many other network protoeols, attains reliability through retransmission. 

Network eoding [U refers to the transmission of predefined funetions (usually a linear eom- 
bination) of packets in order to achieve higher throughput, error correction and better security. 
Wireless communieation, and in partieular the transmission over the wireless ehannel whieh is 
broadeast in nature henee ean potentially be heard by non-addressees of the dedieated stream 
is a natural platform for network eoding. Nonetheless, in order for sueh a meehanism to be 
effeetive, the overhearing users need to store the relevant parts of the traffie streams even when 
they are not the intended addressee. 

In this work, we address the aforementioned seenario of a single AP sending unieast streams 
to K eorresponding listeners. We assume that all streams are fully baeklogged, i.e., there is 
a paeket pending for eaeh reeeiver at all times (infinite horizon). We also assume a typieal 
stop-and-wait ARQ (automatie repeat-request) meehanism, similar to the one adopted by IEEE 
802.11 standard. In such schemes, a sender sends one frame at a time, where each frame is sent 
repeatedly until the sender reeeives an aeknowledgment (ACK) frame from the reeeiver. That is, 
the next paeket to some user will be transmitted only after the previous paeket to that user was 
received correctly. We adopt the decoding and data storage pattern known in literature as instantly 
deeodable network eoding [|2ll, speeifieally, eaeh user stores paekets even if not destined to it, 
yet only uncoded packets are stored at the receivers while coded combinations are discarded. 
We assume that the data stored at the listeners is known to the AP at all times; this can be 
aehieved by eaeh reeeiver piggybacking a list of its current stored paekets not destined to it, on 
the user’s upstream traffie (eaeh DATA or ACK sent by the user to the AP). 

Using network eoding at the AP, the ehallenge in eaeh downstream transmission to is determine 
whether to send an ordinary unieast paeket to one of the intended reeeivers, or to send a linear 
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combinations of packets. Note that even under this seemingly moderate setup, in whieh users 
store only uneoded packets, and the AP has at most a single packet pending per user at a 
time, sinee eaeh user ean potentially store a paeket for eaeh other user (i.e., possibilities 
per user, where K stands for the number of users), the number of different options for stored 
packets before each transmission-opportunity (termed the state space) is enormous 
Consequently, no efficient solution optimal in the general ease exists [|3l. 

In this paper, we design a eomputationally feasible, sealable and robust methodology whieh ef¬ 
fectively addresses the aforementioned problem. Furthermore, in addition to the generie problem 
described above, we also eonsider a more eomplicated setup in which the storage time of packets 
at the receivers is limited by a Time to Expire (TTE) eonstraint, i.e., a packet that its storage time 
has expired, is invalidated and disearded. We present a theoretieal framework and a model-based 
learning implementation whieh allow us to aequire the on-line transmission and retransmission 
poliey under sueh channel eonditions. In particular, we address three specifie challenges. Eirst, the 
fundamental ehallenge of network eoding - deeiding what is the most effeetive linear combination 
of the data to be transmitted. This problem beeomes further eomplieated, onee TTE eonstraints 
are introdueed. Seeond, in eontrast to most known works, our model presumes infinite data 
streams for all listeners, rather than limiting the amount of data to a fixed bloek. Einally, the 
encoding decisions are made in an environment without prior knowledge of the packet loss 
probability. As we elaborate in the related work seetion, previous works in the area mainly 
eonsidered various optimization problems for multieast transmissions and/or finite horizons 
(finite block length). However, this is the first work to address all these ehallenges in a unified 
framework. 

Our main eontributions are as follows: we model the transmission proeess by a Markov 
decision processes (MDP). Sinee the original state spaee is intraetable, we utilize state aggre¬ 
gation. State aggregation (sometimes referred to as state abstraetion) is a teehnique to partition 
the state spaee sueh that all states belonging to the same partition subset are aggregated into 
one meta-state, such that the same policy applies to all states in the meta-state. In contrast 
to a complex exhaustive seareh to find the optimal aggregation, we force a state aggregation, 
based on proved eoding eoneepts. We further introduee a poliey restrieted MDP and an induced 
MDP whieh undergoes a dramatie state spaee reduetion, and show that in ease one ehooses the 
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appropriate reward funetion for the indueed MDP, the overall reward of both proeesses will be 
equal. Specifically, instead of keeping track of all possible packets (coded and uncoded), we 
only keep track of two state variables: (i) The size of the maximal group of users in which each 
member of the group has a packet destined to each other user in the group but its own (i.e., 
maximal clique; accordingly, in the sequel we will refer to any set of users each having packets 
of all other users as a clique, and the maximal such set the maximal clique). Note that for each 
clique, a single coded packet which linearly combines all the packets destined to the users in 
the clique can be sent, and each user receiving the coded packet can decode its own packet, (ii) 
The number of users whose packets are not stored by any other user. Note that this abstraction 
allows us to significantly reduce the state space from 0(2^^) to O(iT^). Consequently, we also 
restrict the action space, such that the only allowed actions are transmitting a packet to one of 
the users currently not having its packet backlogged at any other user, or transmitting a coded 
packet to the maximal clique. Hence, we name the MDP which only allows restricted actions 
based on the aggregation a policy restricted MDP, and the MDP which sees only aggregated 
states an induced MDP. 

Given the transition probabilities, the optimal policy can be read off the Bellman equation 
for the induced MDP, which has a relatively small state space and thus can be efficiently 
solved. However, since the transition probabilities are hard to calculate, we learn them using a 
model-based learning algorithm. Namely, we derive a novel on-line explore and exploit learning 
algorithm, which iterates between the learning phase and the Bellman equation solution phase 
in our problem. Hence, we achieve the optimal policy, which, in turn, results in the optimal 
throughput (under the constraints imposed by the aggregation and state reduction). Note that 
this approach is independent of the channel conditions, and works equally effectively for any 
packet loss, including when the packet loss is not stable and fluctuates around some value. We 
also study the structural properties of the value function, and use these properties to both gain 
deep understanding on the behavior of optimal policies and accelerate the reinforcement learning 
(RL) procedure. Specifically, we prove that under mild conditions, there exists a ’’threshold type 
policy”, namely as a function of the maximal clique size, there is only one transition from one 
optimal action to the other, and once sending a clique is optimal, it continues to be optimal for 
the larger cliques. We show that our algorithm is both computationally tractable and scalable. 
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At the same time, its performanee is eomparable to the upper bounds in dU, whieh are given 
for a much stronger coding scheme, ineluding intra-session eoding, mueh larger state spaee and 
buffers, and no TTE. 

We ineorporate the TTE eonstraint within the aforementioned MDP model and propose two 
types of state aggregations. We eompare our algorithms with known algorithms in the literature 
via extensive simulations. 

A. Related work 

Network coding. While the problem of NC has been widely treated in the multieast setting, 
multiple unieast still provides a rieh ground for ongoing researeh. Coded retransmissions were 
eonsidered in [|51, where, after sending a finite set of paekets to all users and reeeiving ae- 
knowledgements, eoded retransmissions are ealeulated and sent in order to eomplete the missing 
paekets. Henee, this is di finite horizon problem, where a bloek is sent only when the previous 
one is eompletely deeoded. [13 eontinued the above work, seeking to maximize the eoding 
opportunities. Similar to our problem, in [|2l users eannot store eoded paekets. However, O 
fits a multicast scenario rather than multiple unieast. Moreover, the graph required to identify 
eliques in [|3 grows with the stream size, while it is fixed in our seheme. Einite streams and 
elique struetures were also addressed in Additional strategies for finite streams ean be found 
in [|3, M and Q. 

In liTOli . the objeetive was to minimize the delay using random linear NC. Random NC was 
also applied for mesh networks in ifTTI . The finite horizon work [[T3 minimized the delay by 
linear programming. Network eoding for multi-hop wireless network was addressed in [[T3]l . To 
the best of our knowledge, no previous work analytieally treated the setting where the storage 
time of the side information was limited by some parameter (TTE). Praetieal insights on storage 
time eonstraints and imperfeet aeknowledge delivery are given in [[T4l . We also mention the 
MDP based approaeh for perfeet feedbaek IfTSlI and partially observable MDP for uneertain 
feedbaek [HH. Both works, however, are for finite horizon and do not inelude state aggregation. 
Thus, the problem of sealability of the solutions with the size of the stream is raised. 

Reeently, the seminal work in [|4|| gave eodes and bounds for the erasure broadeast ehannel. 
The eoding strategy therein was proved optimal for up to 3 users, and bounds were given for 
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general K (two users were eonsidered earlier in IfTTlI l. The eoding seheme therein assumed more 
than one packet per user can be coded and overheard (intra-session coding), while we only allow 
transmitting the first packet per session. Furthermore, the model in [|4| allows storing coded pack¬ 
ets, at the price of larger buffers and state space, while our model assumes instantly decodable 
codes. Nevertheless, we use the theoretical upper bound in dUl to evaluate the performance of 
the schemes suggested herein, and find them comparable despite the much simpler coding in 
this work. Note also that calculating the regions in BU is exponentially complex in K, while the 
algorithms suggested herein scale well with the number of users. 

To conclude, none of the aforementioned works addressed the problem of multiple unicast with 
infinite horizon addressed in this paper. Reference IfT^ attempted to provide heuristic algorithms 
for a small number of users, yet the algorithms therein show inferior performance compared to 
the learning-based solutions suggested in this work. In addition, IfT^ did not consider the channel 
condition, while our approach is adjustable to the packet loss uncertainty. 

Random linear network coding (RLNC), (e.g., ifT^ ) is used only across flows (only inter-flow 
coding), then, regardless of the filed size used, such a coding scheme will effectively require 
all receivers to decode all the data, which is highly inefficient. Increasing the field size will 
only increase the probability that a sent packet is independent of the previously sent ones, but 
would still require each receiver to wait for a full rank on all the data in the system. Moreover, 
RLNC requires receivers to cache coded packets as well. Indeed, it is well known in the coding 
literature that RLNC is optimal for multicast (all receivers requiring all the information), yet 
highly inefficient for multiple unicast, which is the problem at hand. 

Finally, note that the Wang’s bound discussed and depicted in section |Vll allows for the 
most general coding schemes, including larger window size, buffering of coded packets, intra- 
flow coding and high field sizes. Thus, our results are compared to the most general (and 
computationally expensive) coding scheme, and show good performance. 

Index Coding and ARQ. The relation between NC and Index Coding (IC) Il20l was formulated 
in m- The most general formulation of the IC problem constitutes a setting of K nodes, each 
having a set of packets as side information and expecting an optionally distinct set of packets. 
At the beginning of the communication, all the data is at the base station, and the goal is to 
find a transmission strategy to satisfy all demands. Therefore, this is, in essence, a finite horizon 
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problem. Of course, similar to previous works, IC, in general, allows for complex coding over all 
packets in the block and storing of coded packets at the receivers before decoding. In addition, 
reference ll^ treated IC with side information which includes coded packets as well. Note that 
we do not use the classical formulation of these problems since we do not address decoding of 
finite blocks but view the infinite horizon view of the problem. 

Minimization of the overall transmission time was addressed in ||2^ . The policy described 
in [|2^ . if considered on a per-node basis, results in a greedy algorithm, maximizing the in¬ 
formation gained from a single transmission. In the MDP-based approach herein, however, the 
transmission policy accounts for the ability to transit to more rewarding states in the future, 
hence generalizes the greedy approach. 

Index coding in a scenario where each packet should be transmitted to all was compared 
to an ARQ scheme in |[24l . It was shown that as the number of users K grows, the number of 
transmissions with NC is constant, while it is logarithmic in K in the case of ARQ. ARQ schemes 
were also analyzed in lf25]l and implemented in ll26ll . where the authors considered a broadcast 
network and the queue size at the sender side as the primary performance metric. As for unicast 
scenarios, the finite horizon scheme EH optimized the number of decoding operations, rather 
than the number of transmissions. 

It is important to note that there are a few critical differences between the state of the art in 
index coding and the coding scheme suggested in the paper. First, index coding considers only 
finite horizon scenarios, i.e., each receiver is interested in a fixed, finite list of packets, and one 
has to devise, before communication starts, the best coding scheme in terms of minimizing the 
number of packets required to satisfy all demands. In our problem, users have infinite streams, the 
state of the system (in terms of the side information available) changes after each transmission, 
and one have to make coding decisions after each transmission. Second, the state of the art index 
codes are not instantly decodable, namely, receivers might need to wait for the end of the block 
to decode their data. The scheme herein is instantly decodable. Finally, index coding allows 
the receivers’ demands to partially overlap, hence is more general in this sense. Yet, it is well 
known to be a hard problem (e.g. ETlH . with no efficient solutions in the general case. Thus, 
it is beneficial to consider different settings, in which high gains can be efficiently achieved. 
State aggregation. As a road-map paper for the state aggregation methods see E8ll . This work 


defined 5 abstraetion methods, where the most relevant to our setting is 7r*-abstraction. We 
partially adopt their definitions of aggregated and detailed (ground) states and the corresponding 
abstraction function. 7r*-abstraction can be suboptimal compared to the original MDP [[291. 
However, our approach is different from [[28l . since we do not attempt to perform a search to find 
the aggregation which would preserve optimality, but rather, based on key principles in coding 
and re-transmission, define a robust MDP abstraction, in order to acquire the smallest states space 
and action space. An adaptive aggregation for the average reward MDP was presented in [|^ . 
In this work, the aggregation is generic and partition into aggregated states is being updated in 
the process of the algorithm run. However, it is not clear how to predict the number of states in 
such an aggregation once the algorithm achieved the desired optimality bound. Our aggregation 
is fixed and predefined in order to specifically suit for the given communication problem. Hence, 
both the aggregation and the state-space size we employ are predefined and result in a much 
simpler RL algorithm, at expense of optimality guarantees. Another survey work on abstraction, 
in the context of reinforcement learning is llSTll . State aggregation for continuous MDP is brought 
in [|32l. The authors in [1^ proposed a near-optimal reinforcement learning algorithm aiming 
to asymptotically achieve the optimality of the original MDP. However, running time demands 
needed to achieve the desired optimality gap are not feasible for our purpose. 

II. Model description 

We consider a downlink wireless model, with one transmitter (access-point) and K receivers. 
At the sender, we assume an infinite stream of packets for each user (i.e., unicast traffic). We 
assume a Stop-and-Wait based protocol, accordingly, even though the sender has an infinite set of 
packets per receiver, we assume only one such packet is active at a given time per receiver, i.e., 
the sender does not transmit new packets for a receiver until the active one is received correctly 
and acknowledged. Note that this mechanism conforms to the widely deployed IEEE 802.11 
protocol suite. Our channel model assumes the packet sent at each slot is received at receiver k 
with probability pk, independently of the other receivers and of the previously received packets 
(memoryless independent users). The packet loss probabilities are assumed to be fixed in time. 
We assume that uncoded packets correctly received by a receiver which is not the intended one, 
are cached. Note that, on top of the coding scheme we suggest, of-the-shelve error correction 


codes can be utilized in order to improve pk at the expense of overhead. 

We assume that packets overheard by undesignated users can be stored for future use. Yet, we 
assume that only uncoded packets can be stored at the receivers while coded or corrupted packets 
are discarded. We distinguish between two cases, unlimited storage time and limited storage time. 
We first treat the case where the stored packets are never outdated (i.e. storage time is unlimited). 
Denote by M the space of K x K binary matrices, where each s G M represents a possible 
state. In particular, each line i E {1, • • • , K} constitutes a vector of indicators such that Sij = 1 
if and only if user j has a packet designated for user i. We assume the AP always aware of the 
data kept by the receivers using status updates sent by each receiver. We assume that when a 
receiver overhears or decodes a packet destined to another, it is able to store it. The state of the 
system is updated after every transmission slot. At transmission slot t the state is represented by 
s{t) E M. In the case that user k successfully decodes its packet, = 0,Vi is set. Setting the 
entire row k to zero is motivated by the simple reasoning that users that stored the packet prior 
to the successful transmission can now discard it. The sender can now send the next packet for 
that user. In the case that the destination fails to receive its packet, we set Sk,k' = 1 if the packet 
is heard by user k' and Sk,k' = 0, /c 7 ^ k', otherwise. 

Next, we consider the limited storage time for which the time a packet can be stored at each 
receiver’s buffer; we denote the number of time slots a packet can be stored by Time to Expire 
(TTE). Accordingly, a packet overheard by a non-intended receiver and which is stored for 
more than its maximal validation time is invalidated and discarded. Eor simplicity, we assume a 
system of identical users, i.e., all packets have a similar TTE limit which we denote by T, i.e., 
the maximal time a packet can be stored is T time slots. Respectively, each transmitted packet 
has a TTE associated with it. This value is updated every time slot, until the packet is correctly 
decoded or outdated and dropped. We denote the TTE of a stored packet, at some given time 
slot, as r E {I,-- - ,T} and by r = 0 the case that no valid packet is stored. Every time slot, 
for every packet stored by a user, r is decremented by 1. Once r becomes 0, the corresponding 
packet is outdated and dropped. 

We denote by the space of K x K matrices, where each s E represents a matrix 

of TTE values associated with undecoded packets held by the receivers. In particular, each line 
i E {1, • • • , K} constitutes a vector of TTE parameters, such that Sij = r, if and only if user j 
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has a packet destined to user i, and there are r time slots left till the paeket expires. Similarly 
to the seenario without TTE eonstraint, we assume that the AP is always aware of what data is 
kept by whieh reeeivers. Whenever the intended reeeiver fails to reeeive its paeket, the AP sets 
Sk,k' = T if the paeket is either heard by user k', or user k' already has this paeket stored, and 
sets Sk,k' = 0, k ^ k\ otherwise. Henee, all users that overheard some paeket have an equal 
value stored for its eurrent TTE. This value is stored at the AP and is used for the transmission 
deeisions. 

Eaeh paeket is represented as m symbols over the field F 2 fc. Thus, its payload eonsists of mk 
bits. Now, eaeh time a paeket is sent, the sender has a few options as to whieh type of paeket 
to send. These ’’options” eonstitute its aetion spaee. Speoifieally, it ean either ehoose a single 
paeket from the stream intended to a speeifie user, and send that paeket to that user (termed 
uneoded paeket), or, alternatively, it ean eode together a few paekets. In this work, we used 
the standard linear network eoding [], however, sinee nodes do not store eoded paekets, and we 
require instant deeodability, eoding is done over the binary field. Thus, at every transmission 
slot, the AP eneodes 

2 ; = CXidi © ^2^2©, • • • , ©ttfcdfc (1) 

and sends this paeket, where for eaeh k, G {0,1}, di denotes the paeket eurrently expeeted 
by user i and © denotes bitwise XOR. Namely, the AP deeides on eoeffieients ak G {0,1}, 
where ak = 1 means a paeket for user k partieipates in the eurrent eoded transmission slot. 
Otherwise, = 0. Note that ehoosing = 1 for only one user is equivalent to transmitting an 
uneoded dedieated paeket to user k. Henee, the aetion spaee is of size 2^ — 1, and it ineludes all 
possibilities of uneoded and eoded paekets (exeluding the zero paeket). Reeall that as previously 
explained, only sueh uneoded paekets ean be stored by undesignated reeeivers. Note that paekets 
to be eombined (eoded) are assumed to have the same size (if not, the shorter ones are padded 
with trailing Os). 

The setting deseribed above ean be seen as a framework ineluding a state-spaee, an aetion- 
spaee whieh eomprises the possible paeket eombinations the AP ean send at any given time 
slot (denoting the aetion at transmission slot t by a{t)) and the transition probabilities. Due 
to the Markov property, we deduee that the problem ean be formulated as an MDP, with the 
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objective to maximize the transmission throughput. Hence, we define an appropriate stochastic 
reward r{s{t + 1), a(t), s{t)), associated with transitioning from state s{t) to state s{t + 1) after 
taking the action a{t), such that positive reward is accumulated for each successfully decoded 
packet. For example, if a coded packet of n packets is sent, and m < n of them are successfully 
decoded by their intended receivers, we have r{s{t + 1), a(f), s{t)) = m. Failing to decode gives 
no reward. Storing a packet at the receiver which is not the addressee gives no reward. However, 
note that it may increase the potential number of packets decoded in the future (that is, transition 
to a state with a higher potential value). 

We assume that the same transmission effort is required by the AP whether it transmits 
an uncoded packet, a coded one or does not transmit at all, i.e., fixed transmission costs are 
assumed. Consequently, abstention from sending a packet at any transmission slot is the worst 
option possible. Hence, at each time slot exactly one packet is sent. The objective is to find a 
policy which maximizes the attained throughput, which is measured in 

In the next section, we bring the technical definition of the MDP and state aggregation, in 
order to utilize it for the described model. For general definitions and theory of MDP the reader 
is referred to lf34ll . 


III. MDP WITH RESTRICTED ACTION SPACE AND INDUCED MDP 

In this section, we introduce the general notation which lays the ground for the state aggrega¬ 
tion. We follow the concepts of abstract MDPs in [|^ . yet adjust our notation and forthcoming 
analysis to fit our model and results throughout the rest of the paper. 

As previously mentioned the problem can be formulated as a finite MDP. Let us denote the 
ground MDP by Mq, characterized by the five tuple (S, A, CP, Ik, 7), where S is the finite state- 
space, in which we term every state s G S as a detailed state, since it includes a detailed account 
of system; A is a finite set of actions called the action space, CP are transition probabilities with 
p{s'\s, a) denoting the probability to proceed to state s', being in state s and acting with action a, 
CR is a bounded reward function with r(s', a, s) denoting the expected immediate reward gained 
by taking action a in state s and proceeding to state s'. We consider both long run average 
cost and discounted cost with 0 < 7 < 1 being a discount factor. A policy is a mapping from 
states to actions (S i-A A). In this paper we will focus only on policies that do not depend on 
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the time (stationary policies). We denote the set of all admissible policies by U. We denote by 
p{s'\s,a) the probability to proceed to state s', being in state s and acting with action a, and 
by r(s', a, s) stochastic reward function attained from such instance. The action in some state s 
is denoted by a{s). We further denote by r(s,a) = r(s', a, s)p(s'|s, a) the average reward 

of being in state s and taking action a. As previously mentioned we consider two performance 
criteria: discounted infinite horizon cost and long run average cost. Specifically, the discounted 
infinite horizon cost associated with a given policy vr and initial state Sq is given by 


OO 


J^(so) = E[^7V(4+i,aJ^,St)|so] 


t=o 


where St and aj denote the state visited at time slot t and action taken on time slot t based on 
state St and according to policy ttac- The long run average cost associated with policy is 



( 2 ) 


t=o 


Note that the initial state has no impact on the long run average cost (Eq. Q) as its effect 
is dissolved over time ( Odll l. In this section, we only refer to the discounted case. We examine 
the average case in Section |V] and in the appendices. The value function for the discounted case 
is given by V{so) = sup,,gu J’^(so)- 

We now define the restricted and induced MDPs, which allow us to work with much simpler 
MDPs in our communication problem, yet retain the notion of network coding hence the near- 
optimal performance. 

The policy restricted MDP is stimulated by the state aggregation we suggest. State aggregation 
exploits properties present in the state space of the basic MDP (the detailed states) for aggregation 
of multiple detailed states into one aggregated state obtaining an MDP with smaller state space. 
In particular, a partition S = {si,...,Sn} of the detail state space may serve as an aggregated 
state space if each detailed state is mapped to one and only one aggregated state (ljr=i = 
S ; Si f) Sj = 0). We now formally define the Policy Restricted MDP. 

Definition 3.1. A policy restricted MDP denoted by Mi = (P(Mo,0, A), is defined by 

(I) A mapping 0 acting on S, such that 0 : S i-T- S, where S = IJi disjoint Si, 

(II) A restricted action space A E A, and 
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(III) A restricted set of policies U G 'll, such that for all vr G U, it holds 7r(s) G A Vs G S and 
if (f){si) = 0 ( 52 ) then a’^(si) = a'^(s 2 ), where a’^(si) = 7f(si), and = ^( 52 ). 

In other words, we define a mapping rule 0(s) whieh associates each detailed state with 
an aggregated state, partitioning the state space (S) into the aggregated state space (S). In 
correspondence to the aggregated state space, only policies that enforce the same action for 
all states belonging to the same aggregated state are admissible, i.e., the same action should 
be taken for all Sj G s,. We will use the notation s G s if it holds 0(s) = s, and 7f(s) as the 
equivalent to 7f((/)(s)). 

Note that the policy restricted MDP is still based on the detailed state-space and thus is 
difficult to calculate. Accordingly, we define the induced MDP to which the detailed states are 
transparent. The induced MDP is formed by the atomic states, induced by the aforementioned 
aggregated states, hence, relies on significantly smaller state space, and has similar action rules. 
By means of the aggregated state space and the corresponding policy restriction space, one 
can define transition probabilities as follows: Given an admissible policy tt E U, the transition 
probabilities between the aggregated states which we denote by p{s'\s,d), are: 

p(s'|s,d) = ^^p{s'\sfs,d)p^{s"\s,d) = 

s'Gs' s" 

s'Gs' s'' s'^s' s" 

Where p’^(s"|s) denotes the stationary probability of being in the detailed state s" G S, 
conditioned on the aggregated state s. Obviously, these probabilities may depend on the policy 
TT G 'Ll, hence the superscript tt; yet, for simplicity in the sequel, when clear from the context, we 
will omit the superscript. Clearly, ^ Define the cost of the policy restricted 

MDP as follows: J*^(so) =E[X]“oT'V(s^+ii A>'S 4 )|so] • The corresponding value function is given 
by fu(so) = sup^gu A(so)- Since policy restricted MDP sees the detailed states we also define 
V^(so) = E^ogio '^^(so)p^(soIso) and l/u(so) = sup^g^ A(so). 

Next we formally define the induced MDP: 

Definition 3.2. MDP M = J(Mo,0, VI) is induced by policy restricted Mi on Mq, if 
(I) Each state s E S uniquely relates to some s E S; Denote this relation as s ^ s. 
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(II) For all s ^ s, the actions a{s) available in s are equivalent to a(s). Denote the relation of 
the action space as A A, and relation of the actions a ~ a. 

(III) The transition probabilities are defined on similar probability space and comply with 
p(s'|s, a) = p(s'|s, a), for all s', s, d, for which s s. 

Note that an induced MDP sees no detailed states. That is, each state of the induced MDP 
stands for distinct aggregation of detailed states in a policy restricted MDP Note that if one 
takes a sequence of detailed states {sq, si, S 2 , • • ■} and applies f to it, the resulting sequence 
{0(so), 0(si), 4>{s2), • • •} is not necessarily Markovian. This is because f is non-injective sur¬ 
jective function. That is, it is not a bijection for the reason the injective property does not hold. 
However, as we show in the sequel, one can construct transition probabilities from f (sf) to 
i.e. the aggregated states, such that the resulting process is Markovian. As far as the problem 
of coded retransmission is concerned, the state space is reduced from S = to S, where 

the size of the latter is determined by the properties of the aforementioned mapping f. Denote 
IL defined over A. 

The discounted infinite horizon cost associated with some policy tt G If is given by J'^(so) = 
e[e; ’o 7'f (s'i+i, a^, St) |so] • The corresponding value function is given by Pjx(so) = sup^g.(;[ J’^(so). 

We aim to set the appropriate reward function for the induced MDP such that its value function 
will be comparable to that of the policy restricted one. The relation between J(Mo,0, A) and 
CP(Mo, 0, A) is given by the following proposition: 

Proposition 3.1. For an MDP Mo(S, A, T, IR, 7 ), a policy restricted MDP Mi(S, A, T, 7 ) 
such that Ml = T(Mo, 0, A), and an induced MDP M(S, A, 7, 7, 7 ), where M = J(Mo, <p, A), 
with given initial states Sq ~ Sq, there exists a reward function % such that Pji(so) = ^[(so). 

See Appendix |A] for the proof. 

Intuitively, one sees that the reward of an induced MDP may be interpreted as the suitably 
weighted sum of the rewards of the corresponding policy restricted MDP, normalized by the sum 
of the weights. Note that these weights are found by the transition probabilities to the detailed 
states which compose the corresponding destination aggregated state, s', for which the relation 
s' ~ s' holds. The key point is that with the proper reward function, the induced MDP achieves 


15 


the same value function as the restricted one. Note that since U.\ C U, in general, we have 
^('§0) = < H[(so)- 

IV. State aggregation and reinforcement eearning based solution 

Having laid the ground, in this section we follow the notations and definitions described in 
Section Un] to provide the formal definition of the state aggregation and restricted policy for the 
communication problem considered. Specifically, we will base both the aggregated states and 
the action space on the clique size (which will be defined shortly) and on the number of empty 
lines in the state matrix; the rewards and transition probabilities of the induced MDP will be 
determined accordingly. 

A. State aggregation and the restricted action space 

In order to define the state aggregation and the restricted action space, let us first define a 
clique structure and associate it with clique transmission. We associate a directed graph G{V, F), 
with each state s E S, such that a vertex Vj G V is assigned to each user j and a set of directed 
edges are formed between each user and the users it holds a packet to, i.e., r(s) = {e^ = 
{vi,Vj}\s{i,j) = !}• As commonly defined in graph theory, a clique is a subset of vertices such 
that each vertex is connected to each other vertex in the set, i.e., Q is a clique;iff {\/vi,Vj G 
Q : s{i,j) = 1, Vj 7^ z, j G {1, • • • , a}} . The size of a clique is determined by the number 
of vertices it contains. Note that in the context of our problem any set of users forming such 
a clique (Vuj G Q) implies that each user in the set has all the messages intended to all other 
users in the set. Accordingly, a coded message, composed of all the messages intended to all 
users in the set, can be sent, such that each user in the set can decode its own. Denote the size 
of the maximum clique induced by state s by L(s) and by E(s) the number of empty lines in 
s. 

We construct the aggregation such that each aggregated state is defined by the tuple {F(s), i?(s)}, 

i.e., 0(s) = {L(s), i?(s)}. For clarification let us examine the following example: 

Example 4.1. Consider a communication network consisting of 5 users. Observe the following 
states: 
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Note that Si contains a clique of size 3 associated with users 2,3,4 and a clique of size 2 
associated with users 1,2. The state S 2 contains the 2 cliques of size 3 associated with users 
1, 2, 3 and users 1, 2, 5. There are no empty lines in either state. Since the suggested aggregation 
considers only the maximum clique size and the number of empty lines, both states above pertain 
to the same aggregated state denoted by (3,0), i.e., 0(si) = {L(si), £'(si)} = {3,0} and 
(p{s2) = {L{s2),E{s2)} = (3,0), i.e., 0 (si) = ^(sa) = {3,0}. 

The additional detailed example can be found in Appendix 11.21 Note that the number of 
possible states (i.e., number of unique pairs {L(s), i?(s)}) is dramatically reduced and is upper 
bounded by J = {K + 1)K. Further note that while finding a maximum clique is hard in general, 
graphs resulting from the state matrix in our setting are random and have cliques of logarithmic 
size m, hence L{s) can be found efficiently. 

Having defined the state aggregation, we define the restricted action space. In particular, in 
accordance with the aggregated states we allow only two actions, sending a coded packet to the 
maximum clique which we denote by d = 1 , or sending an uncoded packet corresponding to 
a randomly chosen empty line denoted by d = 2 {A G {1,2}). Note that the restricted action 
space complies with the constraint that the same policy should be applied to all states in the 
same aggregated state. It is important to note that once an action is decided (according to the 
aggregated state), the actual combination depends on the detailed state, (i.e., to which user (users) 
to send an uncoded (coded) packet. In Example 14.1[ since there are no empty lines, the only 
permissible action is to send a coded packet to the maximum clique, that is, sending P2®Pi,®Pa 
for Si or one of pi © P 2 ® Ps, Pi © P 2 © Ps for S 2 . Note that in the case that there are no empty 
lines and the maximum clique size is one, the AP should send a coded packet to one of the 
maximum cliques, yet since the size of the maximum clique is equal to 1, the coded packet 
comprises a single packet hence it is practically uncoded. 

Obviously, the action space defined here is not the only plausible option. For example, one may 
define sending the empty line which has the greatest potential to increase the maximal clique. 
Moreover, in some cases sending an uncoded packet to a non-empty line might be a more 
valuable option. However, our approach is to choose a simple aggregation that even though not 
optimal, is clearly motivated by the original communication problem, hence is expected to attain 
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good results. In addition, we aspire that the number of operations (e.g., determining the maximal 
elique or random seleetion of an empty line) whieh is required from the AP to perform (on the 
detailed states) will be minimal. The evaluation part (Seetion IVTl) eonfirms that even though our 
approaeh is not optimal it attains very good results. 

B. Finding the policy utilizing reinforcement learning 

In the previous subseetion we have defined the state aggregation and the restrieted aetion 
spaee. In order to eomplete the setup in this subseetion we obtain the appropriate reward and 
the transition probabilities p(s|s', a), for the indueed MDP. 

There are three major obstaeles in eomputing the transition probabilities and eonstrueting the 
assoeiated rewards aeeording to Proposition 13.11 First, the paeket loss probabilities typieally are 
not known to the AP. Seeond, in order to eompute the transition probabilities one needs to go 
over eaeh detailed state and eompute the probability of going to eaeh state for eaeh possible 
aetion (it implies order of x aetion). Third, the transition probabilities are 

poliey dependent, i.e., the transition probability of going from aggregated state s to aggregated 
state s' relies on the steady state probability of being in detailed state s given that the system is 
in state s (see equation (O). These probabilities are poliey dependent. Reeall that our objeetive 
is to determine the poliey. Even though the first obstaele is relatively easy to resolve as the 
AP ean keep a history reeord and if neeessary send dedieated probe paekets to estimate the 
paeket loss on eaeh outgoing link, the other diffieulties are more ehallenging as obviously trying 
to eompute the transition probabilities and the reward values is impraetieal. Aeeordingly, we 
utilize reinforeement learning (RL), an effeetive learning teehnique whieh has the eapability 
of finding the reward maximizing poliey, in diserete stochastie environments, without explieit 
speeifieation of the transition probabilities. Speeifieally, RL is based on a feedbaek loop in whieh 
the reinforeement agent (learner or AP in our ease) seleets an aetion based on its eurrent state, 
gets feedbaek in the form of the next state and an assoeiated reward, and updates the estimated 
reeords. The seleetion of the aetion is based on the eurrent state s and the temporary (eurrent) 
poliey, and balanees exploration and exploitation, i.e., on the one hand the agent has to exploit 
what is already known, but on the other hand it has to explore in order to examine other options 
for making better aetion seleetions in the future. Aeeordingly, the agent must try a variety of 
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actions and progressively favor those that appear to be best (e.g., [|351). One of the diffieulties of 
our learning problem is expressed in highly differentiated access frequencies among the various 
states. Aeeordingly, sinee the algorithm is expeeted to visit eaeh state multiple times, we need to 
direet it and to foree it to visit less visited states. Several RL algorithms that ean be utilized to 
solve our problem exist, e.g., MBIE If^ . [[J7]| and R-Max |[38]|; eaeh one has its own merits. 
Nonetheless, sinee our main eoneern is in the applieation itself, rather than trying to adopt one 
of the known algorithms, we derived a modified simple algorithm whieh suits best our problem. 

The proposed algorithm iterates between two steps; the learning step and the poliey improve¬ 
ment step. Speeifieally, we utilize a random poliey (e.g., ehoose at random if to transmit a 
randomly ehosen empty line, or to transmit to the maximum elique) for the learning. In eaeh 
step, we apply the temporary policy whieh was found in the previous step. We utilize e —greedy 
approaeh with the temporary poliey (that is, ehoose the aetion aeeording to the temporary poliey 
with probability 1 — e, and ehoose a random aetion otherwise), for Nk eonseeutive iterations 
(transmissions), reeording the visited aggregated states and the attained rewards (the number 
of eonseeutive transmission ean vary between steps, henee the subseript k). It is important to 
note that even though the system traverses the detailed states, only the aggregated states, the 
aetions taken and the rewards attained are reeorded. That is, the AP does not hold any reeord 
of the visited detailed states. Next, we update the temporary policy, utilizing the newly learned 
reward funetions and transition probabilities obtained during the learning phase, by applying 
value iteration on the eorresponding Bellman equation, that is, 

I/(s) = max I Eg'[r (s', a = 1, s) + yV{s%Es'[r{s',d = 2, s) + 7 l/(s')]|. (4) 

This reinforeement learning proeedure eontinues until suffieient eonvergenee in V (s) or until 
the poliey is unehanged. The outeome of the proposed algorithm is the optimal poliey for the 
indueed MDP and the nearly-optimal eorresponding I^(s). 

A pseudo eode of the algorithm is given in Algorithm A. The algorithm starts with pieking a 
random initial poliey, denoted by ttr (Initialization step in Algorithm A). The random poliey ttr 
we implemented ehooses between the possible aetions with equal probability, namely, d = 1 or 
d = 2 with probability 1/2 eaeh, when the ehoiee is feasible, where 1 and 2 stand for transmitting 
the maximal elique and the random empty line, eorrespondingly. After the Initialization step. 
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Algorithm A 
Initialization 


1) Initialize policy ttJ = tt^. Set n(s', a, s) = 0, R{s', a, s) = 0 

2) Set TT^ = TTfl. 


At step k > 0 

1) Update Sk from predefined decreasing sequence of {efc}- Set N^. 
I TTfi with probability 
I 7r|^ with probability 1 — 

3) Run Ml with tt^ for Nk transmissions. 


2) Set policy tt^ = 


a) Each visit to s acting a with reward r' and going to s', 
set n{s', d, s) = n{s', d, s) + 1, R{s', d, s) = R{s', a, s) + r'. 

4) Calculate p(s'|d, s) and r(s', d, s), from n(s', d, s), R{s', a, s) and Nk- 

5) Find 14 by value iteration over M 2 . Retrieve the optimal policy 

6) If I Jfc — Jfc_i| < e, for some predefined e, then finish. Otherwise perform step k + 1. 


the algorithm runs between two steps; the learning step and the poliey improvement step whieh 
are repeated iteratively. At each step the algorithm starts with a least visited aggregated state 
(the detailed state within can be arbitrary), and starts traversing the states for consecutive 
transmissions, based on the e — greedy policy (line 2). Obviously, only the restricted actions, 
i.e., transmitting an empty line or transmitting the maximum clique, are allowed. The parameter 
e is updated at the beginning of each step (line 1). After each action the agent records the 
previous and the next aggregated states, the action taken and the reward attained (line 4). After 
Nk consecutive transmissions, the policy for the next steps is updated by solving the Bellman 
equation. The algorithm terminates when the policy or the attained value converges. 

Note that the algorithm does not rely on knowing the packet loss probabilities. That is, the 
algorithm learns transition probabilities of the induced MDP at any fixed channel condition re¬ 
gardless of the exact packet loss values. Obviously, the algorithm relies on that these probabilities 
are fixed in time. 

For the average cost long run case, the algorithm should be altered by correspondingly 
adjusting the learning step and the update step (see, e.g., (HI). We discuss the implementation 
details and results in Section |Vll 

C. State aggregation with a TTE constraint 

In this subsection we utilize a similar aggregated MDP formulation to encompass TTE- 
constraints. Since both TTE constrained and unconstrained models are never considered simulta- 
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neously, with slight abuse of notation, we will denote the states for the constrained case similarly 
to the unconstrained one. The connotation will be clear from the context. Since under a TTE- 
constraint stored packets are getting obsolete, the suggested state aggregation will incorporate 
the age of the ’’oldest” line. In particular, we propose two state aggregations, both of which 
maintain the number of empty lines and the age of the oldest line, where Aggregation I also 
preserves the size of the largest clique encompassing this line, while Aggregation II keeps the 
size of the largest clique regardless of whether this clique encompasses the oldest line. Next, 
we formally describe the two state aggregation functions which map the detailed state to the 
corresponding aggregated state; we also design a model-based learning similarly to the case with 
no TTE constraint. 

1) Aggregation I: Define 0/ : {N x N x N}, (j)i{s) = {F,C,E}, where F{s) is 

the lowest strictly positive TTE in s, C{s) is the size of the maximal clique, which contains the 

row with T = F, and E{s) is the number of empty lines in s, where r was defined in section UIl 
Note that C(s) is not necessarily equal to L{s), the maximal clique in s. Denote the action space 
by = {1,2} where d G = 1 stands for sending a coded clique C(s), which contains a 

line with t = F, and a E = 2 stands for sending an uncoded packet corresponding to a 

randomly chosen empty line from E{s). 

Eollowing the formalization presented in Section |III] we define the policy restricted MDP 
denoted by M( = CP(Mo, and the corresponding induced MDP denoted by = 

J(Mo, 0/, (see Definition 13. II and Definition 13.2[ respectively). 

The basic approach for finding an approximately optimal policy under Aggregation I, is by 
harnessing A/gorzYfim A. The corresponding Bellman equation is written similarly to what appears 
in dH), where the solution is found by substituting the relevant aggregated states. 

2) Aggregation II: Similar to Aggregation I we define a second mapping (pn : —)■ 

(N X N X N}, = {F, L, E}, where E denotes the number of empty lines in s, F is the 

lowest strictly positive TTE in s, and L = L{s) denotes the size of the maximal clique in s. 
Note that there is no knowledge about the size of the maximal clique containing the line with 
r = F, as in Aggregation J. Denote the action space A^^ = (1,2,3), where a = 1 stands for 
sending a coded maximal clique F(s), which contains a line with r = F; d = 2 stands for 
sending an uncoded packet corresponding to a randomly chosen empty line, and d = 3 stands 






21 


for sending a L{s), maximal coded clique in s. Note that the action d = 1 presumes no prior 
knowledge about the size of C(s). Thus, the decision in this case is myopic as far as the size 
of clique being sent is concerned. The learning in the case of Aggregation II is performed by 
utilizing algorithm A. We compare by simulations both aggregation types, with an alternative 
heuristic policy in Section |Vll 

V. Study of the properties of V 

In this section, we present an in-depth study of the suggested abstract MDP-based approach by 
exploring the properties of the value function found through the reinforcement learning procedure. 
Our primary objective is to understand the structure of the value function. Namely, we aim to 
isolate properties of I^(s) related to each one of the aggregation parameters. This, in turn, will 
allow us to incorporate these properties in the main learning algorithm, resulting in improved 
speed and precision of convergence. Moreover, it will give us better understanding of how each 
of the parameters (e.g., clique size) affects the results, and how the overall coding process 
should behave as a function of these parameters. In particular, in some cases, we will observe 
a threshold type policy in one of the parameters. That is, a policy in which there is at most 
one switching state from one optimal action to the second. Such a property is desirable as once 
the switching point is found, we may set the actions to their optimal values without the need 
to iterate until the ultimate convergence. Furthermore, in most cases, such a threshold policy 
will give a fundamental and rigorous reasoning to very intuitive results, e.g., if sending a coded 
clique is beneficial for some L{s), it is definitely beneficial for any I > L{s). 

For simplicity, we demonstrate the proof of the existence of a threshold-type policy for the 
1 -dimensional aggregation defined below. 

3) One-dimensional aggregation: As an alternative to the multi-dimensional aggregation pat¬ 
terns, we introduced an even more coarse abstraction. Namely, define 0 : M —)■ {N}, such that 
0(s) = L{s), that is, the size of the largest clique. Denote a line which is not in the maximal 
clique as e-line. Define a state aggregation by the set s = {s : L{s) = /}, for some given 
I, I G {1, • • • ,K}. The action space consists of two actions, a = 1 stands for for sending 
the maximal clique, while a = 2 stands for sending an e-line. While oversimplified, and as 
such resulting in maybe inferior performance, this aggregation and the induced MDP serve as 
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a good example for whieh we ean investigate the value funetion and gain important insights. 
Proposition 15.21 below proves the existence of a threshold policy under an average cost. Let vTo 
be a maximizer over all vr in Q. That is: vTa = arg max^ liniAr^-oo ;^IE [ 

Proposition 5.2. There exists an optimal policy which is threshold policy in the size of the 
maximal clique. Namely, there exists a constant k, k E {2,..., K} such that for 0 < L(s) < k 
and s E s it holds d(s) = 2, yet for k < L{s) < K and s E s, we have d{s) = 1. 

That is, send the maximal clique (a coded packet) if and only if its size is at least k. Otherwise, 
send an e-line (an uncoded packet). 

We will need the following notation for the proof of Proposition 15.21 We say that a state s is 
recurrent under the policy p if when starting at state s and acting according to p, the probability 
to return to s is 1. A state which is not recurrent under p is transient under p. 

Consider a policy vr*, which is optimal for the average long run cost, vr* = arg max^ where 
is given in Q. Denote a set of states Si C S such that G Si if a,r*(’S^*^) = 1. Denote 
a state such that E Si and L(s‘^”^)) < L(s(*)), Vz, E Si. Namely, Si is the set of 
states for which sending a clique is optimal, and is the state with the minimal maximal 
clique in Si - for which it is optimal to send the maximal clique. We have the following claim. 

Claim 1. Any state such that > L(s(™^) is transient under tt*. 

Proof. We use the fact that nodes do not use coded packets in order to decode packets not 
intended to them. Namely, nodes store only uncoded packets intended for other users. Hence, 
clique transmissions cannot increase the clique size, and, moreover, decrease it with some non¬ 
zero probability (note that transmission of an e-line can increase the clique size, yet by at most 
1). Consider some E Si. By definition L(s(*^) > L(s‘^”^)). Since 1) > 0, where 

j < m, the state will be reached in finite number of transmissions. Furthermore, the states 
with clique size more than m will not be attended afterwards. That is, once in the future 
state can not be increased. Consequently, for any such that L(s^®^) > L{s"^), is transient 
under vr*. □ ■ 

Note that the claim holds even if vr* is not the optimal policy. 

Proof. [Proposition 15.211 Consider a policy vr*, which is optimal for the average long run cost, a 
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set of states Si C S and as above. Denote the set Sr such that e Sr if L{s^) < L{s^'^'>), 
and denote St = S\Sr. Now see that by the claim above, is the only recurrent state in Si. 
Define Um, the first time under n* to be in s^. We have 



Observe that all states encountered at times n > Um are recurrent. That stems from the fact that 
after the transmission at time rim, the process stays in Sr- Since Um is finite a.s., the first sum 
(once normalized by N) goes to zero. Next, define policy vr"^ which acts similarly to vr* for all 


j such that (that is, all recurrent states) yet sets = 2 otherwise. That 

is, a threshold policy. Denote by ni the first time to hit under vr”^. Observe that 



Thus vr”^ is also an optimal policy. Note that the relation between ni and rim is not essential, 
since both are finite. 

It is left to show that the policy which always sends e-lines, that is, sends no cliques at all is 
suboptimal. Denote such a policy as tt®. However, in such a policy the expected reward at each 
step is given by 1—p, and any other policy which sends a clique at any step outperforms tt® by 
some e > 0. This accomplishes the proof of the proposition. □ ■ 

The proposition above is intuitive, since the clique size can only be increased by 1. This 
renders all states with the maximal clique larger than the threshold to be, in the long term, 
unreachable. 

Note that Puterman [l40ll gives general guidelines how to demonstrate the monotonicity of the 
optimal policy, both for the average cost and the discount cost infinite horizon criteria. Here, we 
merely presented the short proof which specifically suits this simple case. 

The connection between average and discounted costs, is well-known and is described by the 
Blackwell optimality condition |[34ll . In particular, Blackwell optimal policy is optimal for the 
average cost as well. Yet, as seen from the proof of Proposition 15.2[ the optimal policy for the 
average cost, in this case, is not unique. Hence, the opposite is not necessarily true. Nevertheless, 
we address this in the simulations. 
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The technique demonstrated in the 1-D case can be extrapolated to more complex aggregations. 
However, the proofs in these cases will involve treatment of significantly more complex Bellman 
equations. Alternatively, one may merely assume the existence of a threshold policy, based 
on observations from simulations. The main advantage of having the threshold-type policy 
proof/observation is the possibility to enhance algorithm A, as we explain next. Assume there 
exists a threshold policy in E, as was presented in Aggregation I. Namely, once for some E = i, 
there is a switch from optimal action 2 (transmission of an empty line) to action 1 (transmission 
of a clique), then we deduce that 1 is optimal for all E < i , while 2 is optimal for all E > i. 
Hence, if existence of a threshold policy in one of the parameters (e.g. E,C,E) is known, at 
step 4 of the algorithm, in case the policy in some (possibly rarely visited) state is not yet clear 
at some point of the algorithm run, correct it according to the already known (or conjectured) 
threshold rule. This method will accelerate the overall convergence. Another useful property of 
V, which gives good understanding of its behavior, is its slope. (See Appendix |B] for both upper 
and lower bounds on this slope.) Similarly, the bounds can be useful for the manual calibration 
of the value function in order to speed up the convergence. 

VI. Simulation results 

In this section, we evaluate the suggested transmission strategy through extensive MATLAB 
simulations. Our simulation results provide insight on the impact of each of the mechanisms 
described throughout the paper. Specifically, we thoroughly examine the effect of different 
parameters such as TTE and packet loss probabilities on the value function or on the policy 
structure. In addition we evaluate our algorithm and compare the different aggregations suggested. 

In our simulations we consider a single cell comprising an AP and K receivers. Since our 
results relate to the traffic from the AP to the users, our simulations only consider the downstream 
traffic. We assume that all K users have pending traffic waiting to be transmitted. An Ltd 
Bernoulli channel error is assumed, where each packet transmission is received or dropped 
by each user with probability 1 — p and p, respectively, and is independent between different 
transmission attempts. The AP works according to Algorithm A with corresponding aggregation. 
In all cases compared, the AP activates the learning routine considering the discounted infinite 
horizon cost. Thus, it computes the values attained by value functions for all possible initial 
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states. We later use the same poliey for calculating the long run average cost. Note that based 
on the Blackwell optimality argument (e.g., 0^ 1. if 7 ^ 1, under mild conditions the policy 
which is optimal for the discounted problem is optimal for the average cost problem as well. 
The number of iterations for each phase (learning and improvement) is set in accordance with 
the specific configuration. 

A. Results without a TTE constraint 

We start by evaluating the policy resulted from our learning algorithm, for the proposed ag¬ 
gregation in the case of no TTE constraint (Section ITVl). We compare our results with the bounds 
obtained in ||4l . The aggregation for the TTE-unconstrained case constitutes a 2-dimensional state 
space, namely, the size of the maximal clique C and the number of empty lines E (Section ITVl). 
The action space comprises two possible actions, transmitting to a user that its packet was not 
received by any user (empty line in the state matrix) and transmitting to the maximal group of 
users in which each member of the group has a packet destined to every other user in the group 
(maximal clique in the state matrix). The performance results (i.e the percentage of successfully 
decoded packets, using the retransmissions) are seen in Eigure [IJtop) along with comparison to 
the bound from dH. The bound is derived for systems with much stronger coding capabilities, 
hence any potential scheme, theoretical or practical as can be, cannot attain better performance. 
Denote it as the Wang upper bound. Note that in order to calculate the bound one needs to solve 
120 inequalities, hence the graph has small discrepancies. Eor larger systems, such calculations 
may be too complex. As for the optimal policy, the simulation results show that is the same 
regardless of the packet loss probability. In particular, the optimal policy is defined by transmitting 
a random empty line whenever there are empty lines (E > 0) and transmitting to the maximal 
clique otherwise. Accordingly, the obtained policy is a threshold-based policy. The intuition 
behind this strategy is clear: the reward associated with both possible actions, transmitting a 
random empty line or transmitting the maximal clique, is time independent, i.e., the expected 
reward is the same if the transmission occurs now or in one of the following transmission 
opportunities. Moreover, since any empty line is not included in any clique all the more so in 
the maximal clique, yet transmitting an empty line can potentially increase the size of a clique 
without incurring any penalty for delaying the current maximal clique transmission, it is worthy 
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to fill in the state matrix such that no empty lines are left, and only then to transmit the maximal 
clique. Note that this policy coincides with the one heuristically suggested in IfT^ denoted as the 
semi-greedy algorithm (SG). Accordingly, the simulation results imply that under the restricted 
action space described above, the semi-greedy algorithm IfT^ is optimal, as long as no TTE 
constraints are applied. Moreover, for the simple case of 2-users system, these results achieve 
the sum-capacity which is found according to IfTTlI and Figure [I^down) shows results (value 
functions at all states) for differentiated packet loss. One sees that the case with equal packet 
loss for all users achieves the lowest value function vector. The highest values are obtained for 
the case where two of the five users have relatively low packet loss (0.1), while the other three 
users have relatively high packet loss (more than 0.4). This is explained by that the lossy users 
tend quickly to have a pending packet stored at reliable users. Hence, the lines corresponding 
to these users are most probably not empty while reliable users keep successfully receiving 
uncoded packets. A clique will be sent when some of the reliable users will not receive their 
packet forming a large enough clique for transmission. In overall, the performance is tangibly 
increased, but the throughput improvement comes at expense of hampered fairness. 

B. Results for TTE constrained aggregations 

Next we evaluate the performance of the suggested transmission strategy under TTE con¬ 
straints. 

We smmldXQd Aggregation I (Section ITVl). aiming to examine the structure of the value function 
for all feasible states. Namely, we try to to understand the effect of different parameters on V(s). 
Our objective was to identify simple properties such as monotonicity, convexity and threshold- 
type structure. Such properties can be potentially utilized for the RE convergence speed-up. This 
will allow to successfully operate larger systems. We examined a system with K = 5 receivers. 
We set 7 = 0.99. The results are depicted in Figure [2l The Y — axis depicts the value attained 
by each state, V{F;C; E), (denoted by asterisks). Each value corresponds to the given initial 
state. X — axis relates to an enumeration of the states, {1, 2, • • ■ }. Note that the asterisks form 
groups of monotoneous patterns of values. In particular, the states are assigned numbers which 
grow first in TTE (F), next with maximal clique size (C) and finally they grow with the number 
of empty lines (E). For example, state 1 refers to the state in which there are no empty lines. 
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maximal clique size 1 and TTE = 9, State 2 relates to the values of the state in which there are 
no empty lines, the maximal clique size eontains the line with the greatest TTE is 8, state 96 
whieh is the last state refers to the state in whieh there are 5 empty lines (i.e. the empty matrix) 
Note that for the widespread (e.g., 802.11) poliey that 
only allows uncoded transmissions the value is fixed 
= iIq 99 = 75, whieh is below the scale of the 



V (value function vs. packet loss differentiation) 



graph, i.e., the value for all states is higher than the 
one for the uncoded ARQ retransmissions. 

We emphasized the structure of the value function 
when only a single parameter varies while holding the 
other two are fixed. Speeifically, in order to understand 
the effeet of empty line on the obtained policy, we 
emphasize by the dotted (red) line the states in whieh 
the TTE and the size of its corresponding elique are 
eonstant, speeifieally F = 2, C = 2, and the number of 
empty lines varies {0 < E < 3). This ean be intuitively 
explained by the property that lines which are non- ^ .tt’c 

empty eontain some information that potentially ean be constraint 

exploited in future transmissions, while the empty lines eontain no information whatsoever. In 
addition, in order to demonstrate the value funetion dependenee on the clique size, we emphasize 
the states in whieh TTE is fixed and equals 2 (F = 2), number of empty lines is fixed (we show 
two different values), and the elique size varies. Observe 1/(2; C;0) and 1/(2; C;l) which are 
represented by the solid eyan and the solid magenta lines, for E = 0 and E = 1, respeetively. As 
expeeted, both lines have an inereasing pattern with C, i.e., the greater the maximal clique whieh 
eorresponds to the line with lowest TTE, the greater the value funetion. By observation, one ean 
also assume that the value funetion has a eonvex inereasing form in C (eyan and magenta lines) 
and eonvex deereasing in E (the red line). 

The effect of the differentiated paeket loss is demonstrated in Eigure Odown). We eompared 
four different paeket loss distributions, with average value equal to 0.3. Similarly to the ease with 
no TTE eonstraint, the best throughput is aehieved where paeket loss was with highest variance. 
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However the difference was significantly less visible, which is clearly understood from the TTE 
constraint, since with TTE will limit the number of packet sent by the AP before sending a 
clique incorporating the lossy users pending packets. Note that for the same reasoning, also the 
fairness issue is less acute. Eor example, in the case where most reliable user had packet loss 
equal to 0.12 while the most lossy one had packet loss equal to 0.48, the ratio of the number 
of sent packets by the AP was 7 : 4 in favor of the 
We explore next the dependence of the policy 
found for Aggregation I on various parameters, at 
equal packet loss which ranged from 5% to 35%. The 
results are shown in Eigure IH Eor reference conve¬ 
nience, the first column denotes the state enumera¬ 
tion. Recall, that 1 stands for sending the maximal 
clique containing the oldest line, while 2 stands for 
transmitting a random empty line. 

These results clearly demonstrate that the algo¬ 
rithm converges to the optimal policy in accordance 
with the channel condition. As for the threshold- 
type policy, the proof of this property is hard to 
accomplish, as it relies on the transition probabilities, 
which are hard to attain. However, the threshold- 

Fig. 2. Aggregation I. Each group of asterics represents 

type property, can be observed by simulations, as it 

the number of empty lines. The group with E = 0, 

is seen from the table (see states (20-22), (27-29).) ,, viF,c,Q), ts near lo, V{F,c, i) is near so, 

Note that the property can highly accelerate the RE v{F,c,2) isnear60,v{F,c,3) isnear80,V{F,c,4:) 

... r,.™ . is near 90 and the lowest isolated state stands for the 

procedure. As explained m Section IIVJ the transi- 

empty matrix(top). Effect of differentiated packet loss 

tion probabilities are approximated by RE. Hence, 

simulation-based exploration is imminent in order to identify structural properties. Alternatively, 
one can attempt to prove the threshold property for the average long run case, as we proved for 
the 1-D case in Section |Vl Note that as long as all three dimensions of V{s) are viewed, the 
thresholds are expected to form three-dimensional surfaces. 

We conclude the observations above by proposing an effective speedup for Algorithm A. The 


iiaoie user. 

Value function for 5 users, with TTE=9, p=0.25 



V (value function vs. packet loss differentiation) 
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proposed enhancement stems from simulation results and by the previously discussed properties 
of value function in section |Vl First, in order to successfully operate a larger system, one can 
solve a (trial) system with small number of users with the same aggregation and the same channel 
conditions. Next, the resulting optimal policy can be extrapolated in order to get the policy for 
the desired system, for example, threshold and monotonicity patterns, as we examined above. 
In particular, define an approximating policy vr^ using an assessment based on the policy found 
from a smaller system and the observed properties. Heuristically, this policy should allow a 
randomization around conjectured threshold states. Next, an adjustment of Vi and that of 
is heuristically performed. Again, this improvement can be done using the estimated properties 
of the value function, or can be combined within the regular run of the reinforcement learning 
as it appears in Algorithm A. See also monotone policy iteration algorithm in HOl . 

In order to evaluate the effect of TTE on the policy, we compare both Aggregation I and 
Aggregation II with the greedy and semi-greedy algorithms proposed in [16]. Specifically, the 
greedy algorithm aims at maximizing the instantaneous reward received for each transmission 
opportunity. Hence, the policy according to the greedy algorithm is to transmit the maximal 
clique for each transmission opportunity. Whenever there is no clique (i.e., C < 1) transmit a 
random empty line. The semigreedy (SG) policy is defined in the subsection above. Figure |3] 
(left and middle) compares the value function of the discounted infinite horizon cost with a zero 
matrix as the initial state for the various policies. 

Figure |3] (left) clearly depicts that as expected under the TTE constraints the semi-greedy 
algorithm performs almost as poorly as the uncoded policy. This is explained by that it does not 
take into account lines which can be discarded, hence misses clique transmission opportunities 
just for trying to fill the matrix with non-empty lines. Moreover, in system where the number of 
users is greater than TTE, the AP will never be able to fill the state matrix with non-empty lines 
and the aforementioned semi-greedy algorithm coincides with the uncoded algorithm which sends 
only uncoded packets. Hence, we devised an alternative heuristic algorithm, termed modified 
semi-greedy (MSG). MSG differs from SG in that whenever there is a line in which the TTE is 
going to expire on the next slot (i.e., TTE =1) the AP transmits the maximal clique containing 
the oldest line. The results of the MSG heuristic are also depicted in Eigure [S] Note that MSG 
is indifferent to the channel conditions and acts identically for any packet loss (Eigure |3] left). 
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Further note that even though both polieies rely on the same parameters to make a deeision, i.e., 
both perform based on the triplet {oldest line, maximal elique size, number of empty lines}. 
Aggregation II outperforms the MSG algorithm at all paeket loss values. This ean be explained by 
that MSG, while being effeetive as a simple heuristie algorithm, negleets the ehannel eondition, 
i.e., MSG provides only a single retransmission opportunity for a paeket before it gets obsolete, 
regardless the loss probability. This is opposed to Aggregation II whieh effeetively adjusts the 
policy to the channel packet loss with no prior knowledge on the paeket loss (p), based on the 
on-line learning. Indeed, the advantage of Aggregation II beeomes more prominent at higher 
paeket loss values, as ean be seen in Figure [3l 


V(0) 



1 ^( 0 ) 




1-.-MSG 

i-^ Aggregation I, TTE=5 
I-^Aggregation li, TTE=5 
li,TTE=7 

<► Greedy, TTE=5 

TTE=5 


Fig. 3. Value function comparison. The left and the middle figures show the discounted case. The right figure shows the average cost long run. 


Next, observe that when the number of users is greater than TTE, the effeet of the surplus 
of the number of users is negligible. This stems from the faet that at most E = TTE lines ean 
have non-zero entries at all times. Indeed, we see that iT = 10 leads to almost no improvement 
in performanee eompared to the TTE = 5 ease (the eorresponding lines in the middle graph are 
almost eoineide). Henee, we eonjeeture that for the ease where K > TTE, further state-spaee 
minimization eould be done. However, onee one inereases the TTE parameter the performanee 
improvement is tangible. These results are seen on the middle graph as well. Finally we eompare 
the average eost long run simulation results (Figure [3l right). Relying on Blaekwell optimality, 
we used the same polieies we found for the diseounted ease. One sees the same performanee 
gradation as for the diseounted eost. 
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Fig. 4. Approximately optimal policy, for a system with K = 5 users and TTE = 5. 1 stands for sending the clique containing the oldest line, 
while 2 stands for sending a random empty line. Observe the dependence of the policy on the packet loss, e.g. in states 7,9,11,21,28, 29 (These 
states ai'e marked in red). The impact of the paiameter F can be seen from states {F, 3, 2},(Estates 20, 21, 22), for example. Note that the clique is 
always sent in the cases where F = 1, i.e., the oldest line in this clique is about to expire. In the cases where F > 1, the policy depends on the 
packet loss, and generally tends to change to 2 once p is greater and/or F is higher. 


Appendix 

A. Proof of Proposition \3.1\ 

Proof. We prove by constructing a reward function 3^ = {f(s',d, s)}. Let the rewards as¬ 
sociated with policy restriction and aggregated originating state be f{s',d,s). Observe that 
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= r(s',a,s")j = 1. Hence, 


Er(s', a, s) = ^ r(s', a, s")p(r{s', a, s) = r{s', a, s")) = [r(s', a, s")]p’'(s"|s), 

s" Gs 

Partitioning all states in S to the aggregated states, we have: 

a) = ^ r{s', a, s)p{s'\s, a) = ^ ^ f{s', d, s)p{s'\s, a). 

s' s' s'Gs' 


(5) 


(6) 


r{s, a) = r(s', a, s)p(s'|s, «) = X! ( X! s")]Pa(s"|s))p(s'|s, «) = X! ^ “)■ (7) 

s' s' s" ^s s' s'^s' 

Similarly to f(s, d) in Mi, define f(s, d) in M: 


Thus, we wish to find f(s', d, s) such that 


r{s', a, s) = f{s', a, s). 


(9) 


Since both the summation in ([8]) and the outer summation in (|7]) are over all aggregated states, dH) 
will be achieved by taking: 


f{s', a, s)p{s'\s, a) = r(s', a, s)p(s'|s, a). 


That is. 


f(s',a,s) = 


Es'es' (^(s':a>s))p(s'|s,a) 


p{s'\s,a) 


( 10 ) 


with the mapping s ~ s and a ^ a. Note that one should use ([5]) in (fT^ . Hence, we have the 
desired result: 

oo oo 

kjX ('^0) — ^ ^ T (^n+1, d, 5,j) — ^ ^ ^ ^n) — Tfl ('^o) 

n—0 n—0 


□ 


Example 1.2. The following demonstrates state aggregation (as it was defined by Aggregation 
I in Section 170) and results of Proposition \3.1\ Consider the case of 4 users. Each line holds 
the packets of user i. We exemplify the detailed states where L = ?>, E = 1. These states are 
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aggregated into the state denoted by S 3 i. Possible cliques are demonstrated in the detailed states 
denoted Si, S 2 , S 3 , S 4 below. Observe that these states contain only minimal number of 1-s. 
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See that in si, there are 8 additional options for the last column. In particular, observe the 
following four states with the same empty line and the same clique as in si. 
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The same holds for S 2 ,S 3 and S 4 . Concluding, the state S 31 aggregates 32 detailed states. 

There are two possible actions, denote them a = 1 and d = 2, which stand respectively for 
transmitting the clique and transmitting (the only) empty line. Note that the encoded message 
for Si contains the bits 1, 2, 3, for S 2 it contains packets 2, 3,4, for S 3 it contains packets 1,3,4 
and for S 4 it contains packets 1,2,4. The probability p(sj|s 3 i) stand for the probability to be 
in a specific detailed state which belongs to the aggregated state S 34 , (we omit the superscript 
of the policy in this example). The rest of the example concentrates on the state S5 G S31 and 
action d = 1, i.e., transmission of the clique. Assume the action results in the detailed state Sa. 



^0 

0 

0 

0^ 


^0 

0 

0 

o' 


1 

0 

1 

1 


1 

0 

1 

1 

Sa = 

0 

0 

0 

0 

sg = 

0 

1 

0 

1 


lo 

0 

0 

oj 


lo 

1 

1 

0/ 


Clearly, Sa G Si,3. Further, assume equal packet loss probability denoted by q. The afore¬ 
mentioned transition occurs with probability p{sa\d = 1 ,S 5 ) = q(l — g)^. That is, two of the 
users in the clique (1 and 3) successfully decoded the encoded bit, while user 2 failed to do so. 
See that the same transition can happen from state sg. That is, the clique containing encoding 
0 / 2 , 3,4 was transmitted, and user 2 failed to decode. This transition occurs with probability 
p(sa\d = 1,S9) = g(l — g)^ as well. We sum up over all such detailed states (according to 
Appendix ©.• 

p(Sa|a = 1,S = S3,i) = ^ p(Sa|a = l,Sj)p(Sj|s3,i), 

Sies3,i 
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This summation counts over all 32 detailed states in S 3 i. Clearly, some of the probabilities, e.g., 
p{sa\a, = 1, sf) are zero, hence do not contribute to the summation. For calculation convenience, 
we assume convention that in these cases r(sa, a = 1, Sj) = 0. We calculate the average reward 
associated with the transition from 53,1 to Sa, according to dS])/ 

Ef(Sa,a = l,S3,i) = ^ r(Sa,a = l,Sj)p(Sj|s3,l) 

Note that transition to state Sa, acting a = 1 from 53 , 1 , is only possible when 2 of 3 encoded 
packets were successfully decoded. Thus, the reward for these cases is equal to 2, while for the 
other cases it is zero. Let the subset §' E S to contain the possible next (aggregated) states, 
assuming the clique size in the previous state was 2. Namely, S' = {S 34 , S 2 , 2 , si, 3 , 59 , 4 }, where the 
components refer to the events of successfully decoding o/0,1, 2 and 3 packets correspondingly. 
In order to calculate r( 33 ^ 1 , 0 ), we first summarize over all possible outcomes f(s 3 ,i,a = 1) = 
ffsi r(si,d = 1, S 3 ,i)p(si|a = l,s 3 ,i). Substituting the expected values and the probabilities we found 
above, and arranging according to the aggregated states, we have: 


r{s3,i,a = 1 ) = 

Ef(si, l,S3,i)p(Si|l,S3,i) + y] Ef(Si,l,S3,i)p(Si|l,S3,i) + 

Ef(si, 1, S3,i)p(Si|l, S3,1) + y] Ef(si,l,S3,i)p(Si|l,S3,i) = 

<Si^So,4 

EE Er(si, 1, S 3 ,i)p(si|l, S 3 ,i) = r(si, 1, Sj)p(Si|s3,i)^p(Si|l, S3,i) 

jgS'SiGS jgS'SiGS Sjes3,l 

We now turn to the induced MDP M. Denote s = 53,1 and a = 1. We find the reward associated 
with transition to 54 , 3 , f(so, 3 ,a = 1 , 53 , 1 ). Equate component-wise r(s,d) and f(s 3 ,i,a = 1) as 
follows: 


r(si.3,a = l,S3,i)p(si,3|s3,i,a = 1) = Er{si,a= 1, S3,i)p(si|a = l,S3,i) 

SiGsi.a 

It is left to calculate the probability p(si, 3 |s 3 ,i, a = 1). 


p(si.3|s3.i,a = 1) =p(si,3|s3,i,a = 1) = y] y] p(s'|a = l,s))p(s|s3,i) 

s'GSi, 3 seS3,i 


Finally, the solutions for all possible f(s',d = l,S 3 ,i) are found from 


^31,3,0 = 1 , 83,1 


E8iG8i,3'E^('Si)« = 1,S3 ,i)p(Sz| 1,S3 .i) 
E 8 'e 8 - 1,3 E 8 G 8 - 3.1 = 1 ’ ■5)p(s|s3,i) 


^ 80 , 4 , 0 = 1 , 83,1 


S8ie80,4 a = ■S3,i)p(si|l, S3,i) 

E8'e80,4 S8e83,i = 1’ s)p(s|s3,i) 
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^<S2,2,a=l,S3,l 


EsieS2,2 0 = 1, S 3 ,i)p(Si|l, S 3 ,i) 

E.'e52,2 = l.s)p(s|s3,i) 


’'’83,1.0=1,83,1 


Esiesa,! ]Er(si, a = 1, S3,i)p(si|l, 33,1) 
E 8 'e 83 ,i ^s&s,.i «M'S|s3.i) 


A^ole l/zal p(s|s 3 ,i) are policy dependent and in order to be found, the Markov chain associated 
with the MDP should be entirely solved. As it is explained throughout the paper, we circumvent 
this difficulty by reinforcement learning. This finishes the example. 


B. Proof of Bounds 

We prove low and upper bounds on the slope of l^(s), diseounted infinite horizon eost. 
Denoting the probability to inerease L{s) from fc to A; + 1 when transmitting an empty 
line, see that p^ < p, that is, inerementing the elique is eonditioned on the transmission being 
unsueeessful. Denote hy pi 0 < i < k, the transition probability from state k from to state i, 
when aeting by the transmission of the elique (i.e. a = 1). Note that pf is formally given by 
Pk i — Pi^' = i\s = k, a = 1) Define operator T, eorresponding to the Bellman equation, aeting 
on V 


k 

TV{k) = max{[p|7l/(fc + 1) + {l-pl)-iV{k) + (1 -p)], + {l-p)k]}, (11) 

i =0 


with boundary eonditions 

K 

TV{0) = + {l-p^ohV{0) + (1 -p)]}, TV{K) = Y.pli^aV{f + {l-p)K. 

i =0 


The immediate rewards are explained as follows. The reward for transmission of an empty line 
is given by the probability of a sueeessful transmission, that is 1—p. In the ease a elique of size 
k is transmitted, we have k potential i.i.d rewards, whieh gives (1 —p)k. To simplify the notation, 
denote S{k) = yYli=oPk,i^i^~'^) + i^~P)^ E{k) = pl'-fV{k + l) + {l-pl)'yV{k) + {l-p). 

Let S be the set of funetions from {0,1,..., JT} to M that are nondeereasing, and have slope 
bounded from above by that is 

V{k + 1)-V{k) <d, k€{0,l,...,K-l}, (12) 


and bounded from below as follows: 


V{k) — V{k — i) > i — c, where i € {1,..., K — 1}, fc € {i, * + 1,... ,K}. 


(13) 
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Lemma 11.11 below asserts that T preserves S, and acts on it as a strict contraction. The 
combination of these two assertion implies that V (s) is in S (see the discussion below), that is, 
it possesses the corresponding properties (fTSl) and (fT^ . 

Lemma 1.1. There exist constants c and d, such that one has TS G S. Moreover, there exists 
a constant a G (0,1) such that 

\\TU - TW\\ <a\\U - VL|| for every U,W eS. 

Discussion. The main difficulty of the proof below stems from the ambiguity regarding the 
transition probabilities. That is, the precise calculation of these probabilities is computationally 
infeasible, especially for large number of users, K. We solved this by reinforcement learning on 
the practical side. On the analytical side, we make several assumptions and estimations, which 
we justify throughout the proof. To this end, the proof is primarily built on the assumption that 
V E S and possesses all the corresponding properties. We exploit this assumption in order to 
prove that operator T, acting on S, preserves these properties, that is TV G S. Now note that the 
map defined by operator T in (fTTI) . acting on a complete metric space S, with T : —)■ 

of value functions, is a strict contraction, ED Theorem V.18]. Therefore, T has a unique 
fixed point which solves TU = U. On the other hand, V is the unique solution to the same 
(Bellman) equation in the space of all functions. As a result, V = U. Whence, in case we start 
the converging procedure with initial function which preserves (fT^ and (fTD) , by iteratively 
activating the operator T, we end up with solution which preserves the aforementioned property. 

1) Proof of Lemma li.il ' Denote by ^ the probability pl ^, conditioned that the largest 
fully disjoint clique with the clique of size k, prior the transmission, was of size j. Note that 
j < k. Denote the probability of having such a disjoint clique as pkj (by total probability 
Pi = T.jPi,jPk,j- ) 

By Equation (fTSl) and Lemma [L2l (see the end of this section) it holds either pl ^ = j o + ®i = 
p*(l — p)^“*(^) + oi, for some nonnegative oi, or pi -= 0. (Note, that oi = 0 in the case there 
were no other cliques of size k — i prior to the encoded transmission.) 
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See that by multiple application of (fT^ and (fT3l) 

k 

Sk < jVo+j'^i*pl^id+ (1 -p)k 
i^O 
k 

= 'yVo H- 7 ^ i * Pk^i,od + (1 — p)k + a2{k) = jVq + jpkd + (1 — p)k + a2{k) (14) 

and 

k 

Sk > 7 I 4 -l^ik - i) ^pl^id-\-{l-p)k 

i^O 

k 

= Wk -I'^ik - i) + (1 -p)k - hik) = 7I4 - 7(1 -p)kd-\- (1 - p)k - b 2 {k) (15) 

where 02(fc) and 62(fc) stand for summations of all compensation constants ai{k,i), in both cases above. 

We use the contraction property in the remaining part of the proof. Since, by assumption, V satisfies (Ell 
and (fT 3 l l. we only have to show that 

max{S'(fc + 1), E{k + 1)} — Taax{S{k), E{k)} < d (16) 

max{S(fc — i),E{k — z)} — max{S(k), E{k)} < —i + c (17) 


We analyze all the possible options within the curly brackets, as follows. 
1 . 


TV{k + 1) - TVik) = Sik + 1) - S(fc) 
TV{k -i)- TV{k) = S{k - i) - S{k) 


Applying Lemma fT3] it immediately follows that TV{k + 1) — TV{k) < d and TV{k — i) — TV{k) > —i + c 
in this case. 

2 . 


TV{k + 1) - TV{k) = E{k + 1) - E{k) 
TV{k -i)- TV{k) = E{k -i)- E{k) 


In order to prove the second case we should comply with the expressions for d and c found in the first case. Note 
that pl^^ < That is, the probability to increase the size of the maximal clique then acting by sending an empty 
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line decreases with the state size. Hence, 

E{k + 1) - ^k) = pU,^V{k + 2) + (1 - pU,hV{k + 1) - pljV{k + 1) - (1 - plhV{k) 

= pI+i1V (fc + 2) + (1 - - pI)-iV (fc + 1) - (1 - pI)-iV (fc) 

< + {^-pl)iyik + l)-{l-pl)-iV{k)< -fdpl+i + d{l - pD'j < d'y < d 

and 

E{k -i)- E{k) = pl_aV{k-i + l) + {l- pl_i)lV{k-i)- pl'yV (fc + 1) - (1 - pDjV( k) 

< [pl^i'yVik - i + 1) - pl_i)'yV{k - i)]+'yV{k - i) + [{l-pD'yVik + 1) - {l-pD'jVik)] - 7 l/(fc + 1) 

< 7 dpl_i +'yV{k-i)-{l- pl)d'y - 7 F(A; + 1) < -fdpl_i + (1 - p^)d 7 - 7 ^ - 7 + 7 c < -z + c 

See that for 7 close enough to 1 the last assertion is true. 

i. 

TV{k + 1) - TV{k) = S{k + 1) - E{k) 

TV{k -i)- TV{k) = S{k - i) - E{k) 

Using the proof of case 7: 

S{k + 1) - E{k) < S{k + 1) - S{k) < d 
S{k -i)- E{k) < S{k -i)- S{k) <-i + c 

4. 

TV{k + 1) - TV{k) = E{k + 1) - S{k) 

TV{k -i)- TV{k) = E{k - i) - S{k) 

Using the proof of case 2: 

Eik + 1) - S{k) < E{k + 1) - E{k) < d 
E{k — i) — S{k) < E(k — i) — E{k) < —i + c 
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There are additional combinations, such as E{k + 1) — S{k) and S{k — i) — S{k), however their proof is 
straightforward using same considerations as above. It is trivially seen that all the cases hold for the boundary 
conditions as well. 

To see that V{k) is non-decreasing in k we use the following argumentation. Denote the aggregated state of 
having a maximal clique of size k as Sk, Sk € S. Define function gk ■ Sk ^ Sfc-i, fc > 1, such that for each 
Sk, gk acts by deleting a random line from the maximal clique of size k, i.e. updating all entries of the chosen 
line to 0. We aim to compare V{s{k)) = V{k) and V{gk{s{k))). By simple coupling argumentation one defines 
two processes and sees that V{s{k)) > V{gk{s{k))). We skip the trivial details. Finally the contraction property of 
operator T follows from the well known results on MDP. See ll40l . for example. This accomplishes the proof of 
the lemma. □ 

Lemma 1.2. For j > 2, that is disjoint clique exists, 

Pk,i,j = j>i 
Pk,i. 0 <Pi, 3 , j<i 


Proof. Trivially, in case the disjoint clique is larger than j, the probability to have clique smaller than j is zero. 
Therefore, the first assertion trivially holds, Pki j = ^ j > i- 
Next, see that for all i, 

(18) 

The sum of all transition probabilities from state k acting a = 1, for all j is 1: 

Y.Ph,j = ^ 

i=0 

Hence, the second assertion holds. ■ 

Lemma 1.3. One has constants d and c such that 

S{k + 1)-S{k) <d 
S{k — i) — S{k) > —i + c 


For all k and i < k. 


Proof. We prove by finding such constants. Substitute (fT2T i and ( fTSl l. using inequalities (fT4l i and ( fTSl l, and perform 
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algebraic simplifications. Write 

k k—1 

S{k) - S{k - 1) =-fJ2Pk,iyi + (1 + (1 -p)(^- 1) 

0 0 

< 7Vb + 'ykdp + (1 — p){k) + a2{k) — 'yVk-i + (1 — 'ld){l — p){k — 1 ) + b2{k — 1) 

< dkjk + d'jp — d'j — p + 1 + a2{k) + b2{k — 1) + {1 — k) j + cj < d 

and 

S{k — i) — S{k) < 7Vb + jpd{k — i) + (1 — p){k — i) + a2{k — i) — 7^4 + (1 — 'ld){l — p)k + b2{k) 

< —d'jip + d + k^ k + pi + jc — — k + a 2 {k — *) + b 2 {k) < —i + c 


Next, for simplicity, assume equalities for both inequalities above and write 

/ 

(i7 fc + d^p — (i7 — p+l + (l — fc)7 + c7 = c? 

—d'jip + d + kj k + pi + jc — 'yi — k + a 2 {k — i) — b 2 {k) 

= —i + c 

s. 

Solving for d and c we have the following expressions 

c = A(7 k + jp — j — l)b2{k) — A(j k + jp — j — 1)02(A: — i) 

+ A(^p{'y^ik — 7^1 + 7^fc — 7ifc — 7 fc + i)) ( 19 ) 

d = A'ya 2 {k — i) — A'yb 2 {k) + ^(7 ip — — ^k + jp — p + 1) (20) 

Where 1/A = j'^ip + 7 ^p — 7 ^ — 7 fc — 7 p + 1. Observe that 1/A i;: ip — fc as 7 —?► 1. The rightmost part of 
d in (I 20 I 1 is essentially independent of i and fc, and is less than 1 for all k,i- Consequently, the assumption d is 
independent of fc is plausible. One the other hand, c has very low positive values, comparatively to that of i. Hence, 
the constants d and c above satisfy the lemma. ■ 
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